Dyr og Data

Data visualisation — groups, facets, stats

Gavin Simpson

Aarhus University

Mona Larsen

Aarhus University

2024-09-12

Introduction

In this section, we’ll look at

  • groups
  • facets and the concept of small multiples
  • stats

Data

In this video we’ll use the penguins data set from the palmerpenguins 📦

We’ll also make use of the GB bovine TB data set

library("palmerpenguins")
library("ggplot2")
library("dplyr")
library("readxl")

bovine <- read_xlsx("data/bovine-tb/gb-tb-stats.xlsx") |>
  mutate(date = as.Date(date), year = format(date, "%Y"),
    doy = as.numeric(format(date, "%j"))) |>
  rename(n_cases = n_not_otf)

Some defaults

# labels
cases_lab <- "Number of cases"
doy_lab <- "Day of year"
tb_labs <- labs(x = doy_lab, y = cases_lab)

penguin_labs <- labs(x = "Bill length (mm)", y = "Flipper length(mm)")

# base plot
plt <- bovine |> ggplot(aes(x = doy, y = n_cases))

What went wrong here?

bovine |> filter(country == "England") |>
ggplot(aes(x = doy, y = n_cases)) + geom_line() + tb_labs

Structure in the data

bovine |> filter(country == "England") |>
ggplot(aes(x = doy, y = n_cases)) + geom_line(aes(group = year)) + tb_labs

The group aesthetic

Groups are usually formed when a discrete variable is assigned to a channel, like colour, shape, etc

The group aesthetic is by default set to the interaction of all discrete variables in the plot

Set the group aesthetic when structure in the data isn’t already mapped to an aesthetic or the default is insufficient

Groups can also be formed from facets

But group is for everything else

Combing group with others

bovine |>
ggplot(
  aes(x = doy, y = n_cases, colour = country, group = interaction(country, year))) +
  geom_line() + tb_labs

Small multiples

A small multiple is a series of similar plots using the same scale and axes

Multiple plots show different partitions of the data

In ggplot, small multiple plots are created by facetting

  • facet_wrap()
    • One or more categorical variables used to partition the data
    • Individual plots are arrange in sequence over a number of rows and columns
  • facet_grid()
    • Data are partitioned by two sets of varables & arranged into a grid
    • One set of variables forms partitions for the rows
    • Second set of variables forms partitions for the columns

facet_wrap()

The partition is specified using a formula: ~ f1 + f2

Use the nrow and ncol arguments to set the required dimensions

plt + geom_line(aes(group = year)) +
  facet_wrap(~ country, nrow = 1) + tb_labs

Most commonly used with a single partitioning variable

facet_wrap()

Smmetimes we want to give each data set it’s own axes

Use the scales argument to facet_wrap()

  • scales = "free_y" separate y-axis scales
  • scales = "free_x" separate x-axis scales
  • scales = "free" both x- and y-axis scales are separate

facet_wrap() — separate scales

plt +
  geom_line(colour = "grey70", aes(group = year)) +
  geom_smooth(linewidth = 1.1, method = "loess", se = FALSE) +
  facet_wrap(~ country, nrow = 1, scales = "free_y") +
  tb_labs

facet_wrap() — alternate

breaks <- c(1, 10, 25, 50, 100, 250, 500, 1000, 2000, 3000)
plt +
  geom_line(colour = "grey70", aes(group = year)) +
  geom_smooth(linewidth = 1.1, method = "loess", se = FALSE) +
  facet_wrap(~ country, nrow = 1) +
  tb_labs +
  scale_y_log10(breaks = breaks, labels = scales::comma(breaks))

facet_grid()

Partition is specified using a formula: f1 ~ f2

ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm)) +
  geom_point() +
  facet_grid(island ~ species) +
  penguin_labs

Faceting time series

Use a . for an “empty” margin:

plt + geom_line(aes(group = year)) + tb_labs +
  facet_grid(country ~ ., scales = "free_y") #<<

Small multiples

Highlight a particular year

plt +
  geom_line(colour = "grey70", aes(group = year)) +
  geom_line(data = bovine |> filter(year == "2020"),
    linewidth = 1.1, colour = "red") +
  facet_wrap(~ country, nrow = 1) + tb_labs +
  scale_y_log10(breaks = breaks, labels = scales::comma(breaks))

Stats

Some geoms plot the data directly, other geoms apply a statistical transformation to the data before plotting

The manipulation is done by a stat_xxx() function, or a stat

  • Each geom has a default stat

  • Each stat has a default geom

Stats — what happened here?

penguins |>
ggplot(aes(x = island)) + geom_bar()

stat_count()

The default stat for geom_bar() is stat_count()

It counts the number of observations in each group

Stats create temporary variables that we can use — this is where count came from

Temporary variables are named ..name..

stat_count() creates:

  • ..count..

  • ..prop..

after_stat()

While we access these variables with the ..name.. interface, ggplot2 provides accessor functions

  • before_stat()
  • after_stat()

So we could use after_stat(prop) to access the proportions

Stats — what happened here?

penguins |>
ggplot(aes(x = species)) + geom_bar(mapping = aes(y = after_stat(prop)))

Stats — grouping

Override the grouping by setting it to group = 1

penguins |>
ggplot(aes(x = species)) + geom_bar(mapping = aes(y = after_stat(prop), group = 1))

Stats

Many statistical charts you might know by name involve a statistical transformation

  • bar charts
  • histograms
  • boxplots

Bar charts

If you have the summary data you want to plot as a bar chart, use geom_col()

bovine |> filter(year == "2020") |> group_by(country) |>
  summarise(total_cases = sum(n_cases)) |>
ggplot(aes(y = total_cases, x = country)) +
  geom_col() +
  labs(y = "Total cases", x = NULL, title = "Cases of bovine TB in 2020")

Histograms

Histograms chop the data into segments known as bins

Observations within each bin are counted and possibly converted to a density

A histogram is a series of bars showing the count or density in each bin

geom_histogram()

penguins |> filter(!is.na(flipper_length_mm)) |>
  ggplot(aes(x = flipper_length_mm)) +
  geom_histogram()

Histograms

Number of bins by default is arbitrary — but changes how we view the distribution of the data values

penguins |> filter(!is.na(flipper_length_mm)) |>
  ggplot(aes(x = flipper_length_mm)) +
  geom_histogram()

penguins |> filter(!is.na(flipper_length_mm)) |>
  ggplot(aes(x = flipper_length_mm)) +
  geom_histogram(bins = 5)

Histograms

Some “rules” of thumb can suggest an optimal number of bins, e.g. Sturge’s rule

nbin <- with(penguins, nclass.Sturges(flipper_length_mm))
penguins |> filter(!is.na(flipper_length_mm)) |>
  ggplot(aes(x = flipper_length_mm)) +
  geom_histogram(bins = nbin)

Histograms

Grouping by a discrete variable results in stacked histograms — hard to interpret

penguins |> filter(!is.na(flipper_length_mm)) |>
  ggplot(aes(x = flipper_length_mm, fill = species)) +
  geom_histogram(bins = 20, alpha = 0.4)

Frequency polygons

An alternative in such cases is a frequency polygon — lines join where the mid points of the bins would be

penguins |> filter(!is.na(flipper_length_mm)) |>
  ggplot(aes(x = flipper_length_mm, colour = species)) +
  geom_freqpoly(bins = 20)

Frequency polygons

Use after_stat() to draw the density of the data

Handy, if groups have very different counts

penguins |> filter(!is.na(flipper_length_mm)) |>
  ggplot(aes(x = flipper_length_mm, y = after_stat(density), colour = species)) +
  geom_freqpoly(bins = 20)

Density plots

Density plots are a smooth form of histogram

Density is estimated via a kernel density estimator

geom_density()

Default base geom is geom_area()

fill and colour aesthetics

penguins |> filter(!is.na(flipper_length_mm)) |>
  ggplot(aes(x = flipper_length_mm)) +
  geom_density()

Density plots

penguins |> filter(!is.na(flipper_length_mm)) |>
  ggplot(aes(x = flipper_length_mm, fill = species)) +
  geom_density(alpha = 0.4)

Density plots — alternative

penguins |> filter(!is.na(flipper_length_mm)) |>
  ggplot(aes(x = flipper_length_mm, colour = species)) +
  geom_line(stat = 'density')

Boxplots

Boxplots

A boxplot is formed from Tukey’s five number summary

  • minimum,
  • maximum,
  • median (the middle value)
  • lower hinge (~ the lower [quartile].{highlight})
  • upper hinge (~ the upper [quartile].{highlight})

plus some additional values computed from these

  • whiskers — extend to data min/max or 1.5 x the inter-hinge distance (IQR)
  • outlier — points that are further than 1.5 x the inter-hinge distance

IQR = inter quartile range

Boxplots

penguins |> filter(!is.na(flipper_length_mm)) |>
  ggplot(aes(y = flipper_length_mm, x = species, colour = species)) +
  geom_boxplot() +
  labs(y = "Flipper length (mm)", x = NULL)

Boxplots

Flip x and y:

penguins |> filter(!is.na(flipper_length_mm)) |>
  ggplot(aes(x = flipper_length_mm, y = species, colour = species)) +
  geom_boxplot() +
  labs(x = "Flipper length (mm)", y = NULL, colour = "Species")